Feat/v2 pipeline perf by caviri · Pull Request #38 · Imaging-Plaza/git-metadata-extractor

caviri · 2026-05-18T19:11:20Z

No description provided.

- Added new entries to .gitignore for development and internal files. - Updated `pyproject.toml` to include `logfire` and modified dependency specifications. - Enhanced `uv.lock` with new package versions and added `babel`, `backrefs`, and `ghp-import` packages.

…roduce `include_context_summary` parameter in `/v2/extract` to optionally include compiled context summaries in responses. Enhance `llm_critic` stage with tools for external context verification and owner provenance checks, ensuring better entity relevance assessment. Update documentation and tests to reflect these enhancements, improving overall extraction accuracy and user experience.

…rrect endpoint. This change ensures proper connectivity for the model's API interactions.

…cements. Introduce `POST /v2/extract` for body-based extraction, maintaining existing `GET /v2/extract/{full_path}` functionality. Implement `V2ExtractRequest` model for request validation and update API to handle new endpoint. Enhance documentation and tests to reflect these changes, ensuring improved extraction capabilities and user experience.

…ce the `gimie` function to manage exceptions more effectively, including specific handling for HTTP and connection errors. Update repository analysis logging to provide clearer insights on GIMIE output and analysis success. Additionally, refactor cache management to ensure robust data persistence with improved error logging for cache operations.

… devcontainer configuration to include SSH support and modify post-create commands. Introduce a new script to set the VS Code user's password at container start, enhancing security and usability.

…configuration. Add instructions for local .env setup to enhance security and usability.

…vironment. Add .env.example for environment variable configuration and update devcontainer.json to utilize docker-compose for service management, enhancing container orchestration and usability.

…improve post-create command. Add UV_CACHE_DIR to avoid root-owned cache issues and ensure proper cache directory creation during setup.

…optional DNS configuration. Update documentation to clarify environment variable setup for improved container networking and usability.

…tup. Add DNS resolver instructions to .env.example and include .uv-cache/ in .gitignore to prevent cache files from being tracked. Adjust schema paths in AGENTS.md and scripts for consistency with new directory structure.

…ship assessment and linked entities enrichment. Add main agent for fetching repository information, EPFL assessment prompts, and linked entities enrichment tools. Implement organization enrichment module for enhanced metadata analysis. Establish logging and configuration validation for robust agent management.

… warnings from agents and modules. Clean up unused files related to legacy imports across various components, streamlining the codebase for improved maintainability.

…text for improved context loading. Remove unused graph-related imports and clean up deprecated code in API and related modules, enhancing maintainability and performance.

…mponents. Clean up unused imports and environment variable checks, streamlining the codebase for improved maintainability and performance.

…apture_provider_snapshots.py, generate_mock_data.py, and related testing files to streamline the codebase and enhance maintainability.

…umentation. Update .env.example with additional environment variables and descriptions for improved clarity. Modify .gitignore to include logs and cache directories. Introduce new async job handling for the extraction process in the API, along with companion endpoints for job status retrieval. Update AGENTS.md and API reference documentation to reflect new tools and functionalities.

…etection. Introduce a new pipeline stage for determining parent-child relationships among organizations using an LLM agent. Add prompts for LLM input and output formatting, ensuring compliance with hierarchy rules. Update API to integrate the new stage and handle warnings for rejected relationships.

… Introduce `_is_link_veracity_enabled` and `_resolve_max_concurrent_agents` functions to manage environment variable settings for link verification and concurrent processing limits. Update orchestration logic to utilize these new configurations, enhancing performance and flexibility in the extraction process.

…ndle empty schema:author arrays. This function salvages repository entities by assigning a fallback owner from the reconciled graph when reconciliation results in an empty author array, addressing a known bug in the validation process. Update imports and module exports accordingly.

…n. This addition enables the Qdrant vector database for enhanced data storage and retrieval capabilities within the development environment. Update the service configuration with appropriate ports, volumes, and restart policies.

…nd enhance .env.example with additional configuration options. Update documentation to reflect new environment variables for improved clarity and usability in the development environment.

…e, RenkuLab, and EPFL Graph indices. Update AGENTS.md and justfile with new indexer commands and documentation for improved clarity and usability. Modify mkdocs.yml to reflect new documentation structure and sections.

…tes and specific `/v2/*` endpoints. Update documentation to reflect new API security requirements, including the introduction of `API_TOKEN` for protected routes. Enhance `justfile` commands to include authorization headers for cache management and extraction tests. Add new `discover` and `hydrate` protocols for federated indexing, along with corresponding CLI commands and documentation.

… GitHub Enterprise variable and add new context summary scout mode configuration. Enhance AGENTS.md and getting-started.md with updated Open Pulse Ontology version and additional details on the extraction process. Modify index.md and rag-indices.md to reflect changes in data storage layout and improve clarity on indexer configurations.

…r transient errors and add GitHub rate limit probing. Update .env.example to include configuration for maximum retry attempts. Modify API health checks to report GitHub rate limit status. Introduce validation for organization handles against GitHub API to ensure accurate entity representation.

….0.0 Feat/open pulse ontology v2.0.0

…cle prompt Profile of one hybrid extraction on a 50-person paper-heavy repo (deeplabcut/deeplabcut): person stage 5:06, **article stage 12:00 on a single LLM agent invocation that fired 100+ tool calls**, membership stage 6:37 — total ~25 min, dominated by serial waits and an unbounded article-agent tool-call loop. Three independent quick wins, each tunable per deployment: 1. `max_concurrent_agents` default 3→8 in the orchestrator (and the `V2_MAX_CONCURRENT_AGENTS` env-resolver default in the API layer raised 6→8). Person/membership stages are bottlenecked on the `asyncio.Semaphore`, not the LLM provider — bumping the cap absorbs wider-fanout repos without saturating RCP. 2. New `_default_usage_limits()` in `V2LLMRuntime`: caps every agent invocation at 25 model requests + 50 tool calls via pydantic-ai's `UsageLimits`. Without a cap the article agent could keep cross-validating the same DOI across five tools indefinitely. Overridable per-call (existing kw-only signature) or globally via `V2_LLM_REQUEST_LIMIT` / `V2_LLM_TOOL_CALLS_LIMIT`. The cap turns runaway loops into clean `LLMRuntimeError` that the per-stage runner already handles as a per-item warning. 3. Tightened the article agent system prompt with two "stop early" rules: emit immediately once a concrete DOI/title is found (no cross-validation past two sources), and emit `{}` after two consecutive empty searches instead of looping. The LLM was doing ~25 OpenAlex calls per agent invocation chasing the same paper. Expected impact on the profiled repo: ~25 min → ~10 min wall time. No schema, API, or routing changes — caps are tunable and conservative defaults.

caviri added 30 commits February 23, 2026 20:25

feat(v2): implement P0-01 strict schema promotion

4e7c2c2

feat(v2): implement P0-02 agent schema promotion

e60a3ca

feat(v2): implement P0-03 test infrastructure

8fa9881

chore: Remove organization enrichment tests from the test suite

5f3d34c

feat(v2): implement P0-04 strict schema validation tests

f7cd949

test(v2): implement P0-05 agent schema valid-fixture checks

35f1751

test(v2): add P0-06 strict negative schema validation

911c3fd

feat(v2): add mock github provider fixtures and interface

7e6f4cb

feat(v2): add mock infoscience and ror providers

26be017

feat(v2): add deterministic mock dataset generator

9c8a15e

chore(v2): normalize mock generator constants to ascii

3fb2fe8

feat(v2): add mock ORCID provider for P0-08

9da90b3

feat(v2): add cross-reference validation for P0-12

5b97d65

test(v2): add red-phase golden tests for P0-13 and P0-14

2c20d23

docs(v2): advance entry task and log P0-08/P0-12/P0-14

c910e35

feat(v2): scaffold phase-1 package skeleton

48100e5

docs(v2): advance task pointer and record p1-01 validation

ee27544

feat(v2): add config module and github url classifier

1135813

docs(v2): advance phase-1 task tracker and changelog

fcd687d

feat(v2): add P1-05 response contracts

2a74c71

feat(v2): add P1-06 error models

a0a9dc1

feat(v2): implement P1-07 and P1-08 stub endpoints

54b68b3

docs(v2): advance entry task and log P1-05 to P1-08

ed2de49

feat(v2): mount v2 router in main api

62f6f6e

feat(v2): add v2 health check endpoint

0f208ad

docs(v2): advance entry task and log P1-09 P1-10

e4a8ea7

feat(v2): implement phase-2 provider interfaces and agent wrappers

56dfc87

docs(v2): advance phase entry task and log phase-2 testing

d80dda5

feat(v2): implement phase 2 tasks p2-05 through p2-09

44d5ebd

caviri and others added 28 commits March 6, 2026 10:38

fix: Update base URL for OpenAI-compatible model configurations to co…

b7d97de

…rrect endpoint. This change ensures proper connectivity for the model's API interactions.

feat(devcontainer): Add SSH feature and password setup script. Update…

151d2f6

… devcontainer configuration to include SSH support and modify post-create commands. Introduce a new script to set the VS Code user's password at container start, enhancing security and usability.

chore(env): Update .env.example to include optional devcontainer SSH …

e5547c2

…configuration. Add instructions for local .env setup to enhance security and usability.

feat(devcontainer): Introduce docker-compose setup for development en…

89989da

…vironment. Add .env.example for environment variable configuration and update devcontainer.json to utilize docker-compose for service management, enhancing container orchestration and usability.

feat(devcontainer): Update environment configuration for caching and …

03a3e27

…improve post-create command. Add UV_CACHE_DIR to avoid root-owned cache issues and ensure proper cache directory creation during setup.

feat(devcontainer): Enhance .env.example and docker-compose.yml with …

8ac08d9

…optional DNS configuration. Update documentation to clarify environment variable setup for improved container networking and usability.

refactor(v2): Remove deprecated compatibility layer and legacy import…

34294a7

… warnings from agents and modules. Clean up unused files related to legacy imports across various components, streamlining the codebase for improved maintainability.

refactor(simplifying v2): Replace JSONLDExporter with load_jsonld_con…

629c42e

…text for improved context loading. Remove unused graph-related imports and clean up deprecated code in API and related modules, enhancing maintainability and performance.

refactor(v2): Remove Logfire integration and related observability co…

4a66f0d

…mponents. Clean up unused imports and environment variable checks, streamlining the codebase for improved maintainability and performance.

refactor(v2): Remove obsolete scripts and testing utilities. Delete c…

2c9fd90

…apture_provider_snapshots.py, generate_mock_data.py, and related testing files to streamline the codebase and enhance maintainability.

feat(rag system and docker compose): Remove obsolete .env.dist file a…

cbba58e

…nd enhance .env.example with additional configuration options. Update documentation to reflect new environment variables for improved clarity and usability in the development environment.

Merge pull request #28 from Imaging-Plaza/feat/open-pulse-ontology-v2…

b8250e2

….0.0 Feat/open pulse ontology v2.0.0

Update pyproject.toml

7ec4727

caviri closed this May 18, 2026

caviri deleted the feat/v2-pipeline-perf branch May 18, 2026 19:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/v2 pipeline perf#38

Feat/v2 pipeline perf#38
caviri wants to merge 130 commits into
mainfrom
feat/v2-pipeline-perf

caviri commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

caviri commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant